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Abstract —Lattice reduction (LR) is a preprocessing technique 
for multiple-input multiple-output (MIMO) symbol detection to 
achieve better bit error-rate (BER) performance. In this paper, 
we propose a customized homogeneous multiprocessor for LR. 
The processor cores are based on transport triggered architec¬ 
ture (TTA). We propose some modification of the popular LR 
algorithm, Lenstra-Lenstra-Lovasz (LLL) for high throughput. 
The TTA cores are programmed with high level language. Each 
TTA core consists of several special function units to accelerate 
the program code. The multiprocessor takes 187 cycles to reduce 
a single matrix for LR. The architecture is synthesized on 90 nm 
technology and takes 405 kgates at 210 MHz. 

I. Introduction 

Multiple-input multiple-output (MIMO) is a key technology 
to utilize the available radio spectrum efficiently. The basic 
idea of MIMO is to send multiple independent data streams 
from multiple antennas in the same frequency band. These 
independent streams need to be separated at the receiver to 
identify the symbol that is being transmitted by using a MIMO 
detector. Maximum likelihood (ML) is the optimal solution 
for the MIMO detection problem that compares the incoming 
symbol with every possible symbols in the constellation. 
However, the ML algorithm is too complex for practical real¬ 
time implementations. Linear detection is popular for practical 
implementations. The linear MIMO detection algorithms are 
less complex, but suffers from a degraded bit error-rate (BER) 
performance. 

Lattice reduction (LR) is a preprocessing technique that can 
be used with the linear detection to significantly improve the 
BER performance and reduce the gap between the traditional 
linear detectors and optimal ML. LR transforms the MIMO 
channel matrix to a near orthogonal matrix and thus facilitates 
to achieve a better BER performance. The most used LR 
algorithm is called the Lenstra-Lenstra-Lovasz (LLL) algo¬ 
rithm according to the name of the inventors |[T]. The LLL 
algorithm poses many challenges due to the undeterministic 
execution time and higher computational complexity. We pro¬ 
pose a modified LLL (MLLL) algorithm that is based on the 
original LLL algorithm on complex domain. We use a fixed 
structure for the LLL based on E). Instead of using the Lovasz 
condition, a less complex Siegel condition is applied 0 . An 
early termination technique is used as proposed in 0 . We 
demonstrate by Matlab simulation that the BER performance 
loss of the hardware friendly MLLL algorithm is negligible. 


There are several hardware accelerators proposed in a 
0 0 0 for different LR algorithms. The fixed hardware 
implementations provide high data rate and consume less 
silicon area compared to the customized application specific 
processors (ASIP). The drawback of the fixed hardware imple¬ 
ment ation is that it operates only on a fixed set of parameters 
due to the hardwired control path and it is not possible to 
modify the control path in the future. An ASIP customized 
for a small set of algorithms is an attractive solution in terms 
of cost, silicon area and high throughput. Most importantly, an 
ASIP reduces the design risk with an instruction memory that 
can be used to load new programs or control instructions. The 
control instructions can be easily obtained by a retargetable 
compiler for that particular customized architecture. 

A customized very long instruction word (VLIW) processor 
is implemented in 0 for the LR. We take different approach 
and design a customized multiprocessor based on the transport 
triggered architecture (TTA) paradigm. TTA is a processor 
design philosophy where the programmer can control the 
internal data transports between different function units of the 
processor. TTA exploits the instruction level parallelism (ILP) 
by processing several instructions in a single clock cycle. The 
TTA based codesign environment (TCE) tool is used in this 
work to design the TTA processor cores. TCE enables the 
designer to write an application with a high level language 
and design the target processor in a graphical user interface at 
the same time. A turbo decoder and a MIMO detector design 
using TCE can be found in 0 and ifTol . In this work, every 
core of the proposed multiprocessor is programmed with C 
language to shorten the time-to-market. The multiprocessor 
takes 187 cycles and achieves a maximum clock frequency of 
210 MHz on 90 nm technology. To our knowledge, this is the 
first TTA based customized architecture for LR. 

H. System model 
A. Conventional MIMO Detection 

Consider a MIMO system consists of Mt transmit antennas, 
which are sending data over the channel and Nji receive anten¬ 
nas which are receiving transmitted bits from the channel. The 
modulation scheme that is used here is quadrature amplitude 
modulation (QAM) with the assumption Nji > Mt. The 
received signal y can be represented as 
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y = Hx + n, (1) 

where y G is the received signal vector, x G 

the transmit symbol vector, H G is the channel 

matrix and n G is the circularly symmetric complex 

white Gaussian noise vector with zero mean and variance 
In the receiver, the linear zero forcing (ZF) detector cal¬ 
culates the inverse of the channel matrix to compute the 
transmitted symbol vector which can be expressed by, 

i = (Hf^H)-‘Hx = H^'x. (2) 

where H is the channel matrix and (•)^ denotes the pseudoin¬ 
verse. Typically, the channel matrix H is QR decomposed into 
two parts as H ~ QR. Here Q G denotes a unitary 

matrix and R G denotes an upper triangular matrix. 

B. Lattice Reduction 

A lattice is a periodic arrangement of discrete points. A 
lattice can be characterized in terms of a set of basis vectors, 
where any points of the lattice can be represented by a 
superposition of integer multiples of the basis vectors. For 
simplicity, we call the set B = ( 6 i, 62 , bn) as the basis of 
the lattice. 

A complex valued lattice in the n-dimensional complex 
space C" can be defined as 

£ = {u|u = Bu;}, (3) 

where B is the basis of the lattice and u) = [wi, W 2 , •■■■, w„]. 
Note that in (3), the v, u) and matrix B can be replaced with 
y, X and H respectively to obtain C = {y|y = Hx}. In this 
case, the vector space C is the set of all possible undisturbed 
received signal points. There are many bases that can span the 
space £, and the aim of the LR algorithm is to find a set of 
least correlated base with the shortest basis vectors im. 

C. LR-based MIMO Detection 

LR finds an improved basis for the lattice induced by the 
channel. The original basis and the reduced basis are related 
by a unimodular matrix, T. Therefore, the LR aided detection 
finds the received symbol in the new reduced basis and 
afterwards transfer the signal in the original lattice. The new 
channel matrix after the LR can be expressed as, H = HT 
and the transmitted signal is also treated as multiplied by 
which is z = T^^x for the reduced basis. The received signal 
y = Hx -b n can be expressed as 

y = HTT^^x -b n = Hz -b n. (4) 

The LR aided detection operates on H and z instead of H 
and X. The LR aided ZF detector can be expressed as 

i = (H^H)-‘Hz = H^'z. (5) 

The LR algorithm is applied on the QR decomposed H to 
obtain the modified Q and R. Afterwards, the lattice reduced 
channel matrix can be obtained as H = QR. 


III. Lattice Reduction Algorithm 

LLL algorithm is widely used to compute the suitable 
unimodular matrix T and to obtain a reduced lattice basis. 
LLL was originally proposed for the real valued LR IT]. 
However, the channel matrix is naturally complex valued and 
therefore, complex version of LLL (CLLL) is used to reduce 
the complexity. 

The CLLL algorithm suffers from irregular dataflow, 
which eventually leads to higher latency. Therefore, a fixed- 
complexity LLL (fcLLL) algorithm is proposed in E). The 
fcLLL alters the signal flow of the CLLL to follow a deter¬ 
ministic structure. It is possible to utilize less complex Siegel 
algorithm instead of the complex Lovasz condition ID. It is 
also very important to use an early termination mechanism to 
meet the strict requirements. Applying all this modifications, 
we propose a modified-LLL (MLLL) algorithm for LR with 
less complexity and negligible BER performance loss. The 
MLLL implemented in this paper is summarized in Algorithm 
1 . 


Algorithm 1 Modified CLLL Algorithm (MLLL) 


INPUT: Q G , R g 
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Initialization Q := Q , R := R , T 
k-.= 2 

while k < iterations 
for / = fc — 1 to 1 step —1 
p = R((,fc)/R((,() 
if ^ ^ 0 

R(1 : k) := R(1 :l,k)- /rR(l : 1) 

end 
end 

if 5R(fc-l,fc-l)2 >R(fc,fc)2 

Swap columns k — \ and /c in R and T 


/3 = 

14: 


0 = 
R.(fc,fc-1) 


a 

-n 


with a = 


R(fc-l,fc-l) 

||R(fc-l:fc.fc-l)|| 


and 


|R(fe-_l:fe,fe-l)|| 

R(fc — 1 : k, k—1 : k) := 0R(fe — 1 : k,k — l : k) 
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20 


Q(:,fc- 1 : k) := Q(:,fc- 1 : fc)0^ 
k := maxjfc — 1, 2} 

else 

k := k + l 

end 

end 


The BER performance of the traditional ZF, original CLLL 
aided ZF, MLLL aided ZF and the optimal ML is simulated 
for various signal-to-noise (SNR) in a Matlab simulator. An 
additive white Gaussian noise (AWGN) channel is used for 
16-QAM modulation and the BER is averaged over 10 000 
Monte-Carlo trials. Fig. 1 shows the MLLL algorithm with 5 
iterations. It can be seen that the performance loss is negligible 
compared to the original CLLL algorithm. 
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Fig. 1. BER peformance of MLLL algorithm. 


IV. TTA Multiprocessor for MLLL 
A. Special Function Units 

Six special function units (SFU) are designed and written in 
VHSIC hardware description language (VHDL) to accelerate 
each iteration of the MLLL algorithm. A special function unit 
is designed to support the complex multiplication (CMUL) 
operation. Data level parallelism (DLP) is applied in the design 
by packing the 16-bit real part and 16-bit complex part in 
a 32-bit complex variable. Therefore, CMUL uses four 16- 
bit multipliers, a single 16-bit adder and 16-bit subtractor to 
support the complex multiplication. 

Two SFUs for p calculation and size reduction are designed 
according to 0. These SFUs are single-cycle and multiplier¬ 
less. It is observed from the Matlab simulations that the value 
of p has a range of [—4,4], and thus, the dynamic range of 
the SFUs are set accordingly. Another simple SFU is designed 
to compute the SIEGEL criterion. Instead of multiplying the 
input with .75 the SIEGEL SEU calculates the value with a 
combination of two shifters and one adder. 




Control 


Fig. 2. 4-cycle cordic architecture. 


The most complex SEU that lies in the datapath of a single 
TTA core is the CORDIC SEU. A master-slave cordic is 
considered in this work ID. The master-slave CORDIC is 
a combination of two CORDIC blocks in vectoring mode 
and rotation mode respectively. By setting the input as 1 


and 0 of the CORDIC with rotation mode, it is possible to 
calculate the cosine and sine values directly. Therefore, the 
angle calculation done in a conventional CORDIC block is 
not needed here. In every stage it is possible to calculate the 
values of the signums and add or subtract accordingly in the 
rotation mode. As we need a 16-bit CORDIC, there are two 
options to design it. An iterative CORDIC that uses registers 
and iterates 16 times over the 1-stage datapath. However, it 
takes 16-clock cycles to compute the output. Eor a processor 
based implementation a 16-cycle SEU is complex as there will 
be 15 NOP operations in the assembly code. It is possible to 
fully unroll the CORDIC block without any registers. Then 
the critical path for the CORDIC block becomes too high. We 
find a compromise between the two approach and design a 
4-stage CORDIC datapath that can be reused four times to 
create a 4-cycle master-slave CORDIC. The block diagram of 
the master-slave CORDIC is presented in Eig. 2, where the 
ellipse contains a single stage of the datapath. An ARRANGE 
SEU is designed to rearrange the 32-bit variables. 

B. High Level Architecture of the multiprocessor 

A 32-bit fixed point TTA processor is designed to support a 
single iteration of the MLLL algorithm and five of these TTA 
cores are connected in a pipelined fashion to compute the LR 
matrix. Part of a single TTA processor core is illustrated inside 
the dotted block of Pig. 3. Eor readability, the whole processor 
is not given. The blocks in the upper part of the core represent 
the function units and register files of the processor. The black 
horizontal straight lines represent the buses of the processor. 
The vertical rectangular blocks represent the sockets. 

r- 



Each core includes the load/store unit (LSU), arithmetic 
logic unit (ALU), global control unit (GCU), register files, 
several conventional function units and SPUs. The Q, R 
and T matrix are read from three separate first-in-first-out 
(PIPO) memory buffer by using the function units called 
STREAM. The STREAM units can read every input sample 
in one clock cycle. Three STREAM units are used to get the 
inputs simultaneously. Three STREAM unit is used to write 
the outputs in the memory buffer. 

Ten register files are used to save the intermediate results. A 
single Boolean register file is included in the processor design. 
When the registers are not enough, the processor is able to 
access the data memory to temporary store data through the 
LSU. The SPUs can be called by macros to accelerate the 
program code. 
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Eight buses are used in a single core and therefore the core is 
able to process eight instruction in a single clock cycle. As the 
datapath is exposed to the programmer in the TTA architecture, 
it is possible to remove the unused or less frequently used 
connection between the function units and buses. Thus, several 
connections of the processor is removed to reduce the cost of 
a core. The connection between function units and buses is 
illustrated by black spots in the sockets of Fig. 3. 

The cores are connected to one another by FIFO buffer 
memories. In this way, five iteration of MFFF can be pro¬ 
cessed in parallel. The cores are identical and except the first 
core, the rest of the cores use the same program image. 

V. Results and Discussion 

It can be seen from the TCE tool that the TTA multiproces¬ 
sor takes 187 cycles to compute the MFFF algorithm. Some of 
the operations executed during a single iteration is presented 
in Table |I] The conventional operations like addition, shifts 
are not shown in the table. 


TABLE I 

Number of operations 


Operation 

No. ot Ops 

AKKAJNUb 

18 

CORDIC 

9 

CMUL 

72 

STREAM 

H4 

JilhObL 

3 

SlZb RbULidlUN 

34 


The multiprocessor is synthesized using UMC 90 nm 
standard cell library {fsdOk_generic_coreldOvtc). Synopsys 
Design Compiler is used to estimate gate count and maximum 
achievable clock frequency. The operating conditions (temper¬ 
ature, operating voltage, manufacturing process quality) for 
synthesis are set to default value (TCCOM). The maximum 
clock frequency achieved during the synthesis for the multipro¬ 
cessor is 210 MHz. The total gate count of the multiprocessor 
at 210 MHz is around 405 kgates. 

A comparison with different other implementations of FR 
is presented in Table |II] Two important VFSI architectures 
for the FR algorithm can be found in ||4l and ||6l with low 
latency and area. The authors implemented the reverse-siegel 
FEE (RS-FFF) and hardware-optimized FEE (HOFF) in PI 
and El respectively. The VFSI architecture for the Clarkson’s 
algorithm is provided in El- The architecture provides less 
throughput than our architecture even after using a hardwired 
control path. The latency of the VFSI architecture of 0 is 
lowest, but with the price of a very low maximum achievable 
clock frequency of 37 MHz. Though most of the VFSI 
implementations take less cycles and area, the architectures 
suffer from inflexibility, and as a consequence later field 
updates are not possible. 

As different variants of FEE algorithms are proposed in dif¬ 
ferent literatures, a flexible implementation is a necessity. Our 
customized multiprocessor is an example of such a flexible 
implementation with moderate latency and cost. It is possible 
to support different variants of FEE algorithms by updating 


the instruction memory with new binary program image. The 
updated binary instructions can be obtained by compiling the 
other FEE algorithms for our particular architecture with the 
help of a retargetable compiler. 

The programmable VFIW core E) takes less clock cycle 
and flexible. The implementation consisted of not only FR, 
but also QR decomposition and detection also. However, it is 
not clear the amount of area needed only for the FR. The total 
area is very high compared to the other implementations even 
at 40 nm technology. 

TABLE II 

Implementation comparison 


Reterence 

Architecture/tech. 

area 

max-lreq. 

cycles 


4 


.13 /im 

10 / kUL 

333 MHz 

14 


5 


Virtex-Ii Pro 

N/A 

100 MHz 

420 


5 


.13 /im 

125 kGE 

352 MHz 

—4D— 


7 


9U nm 

200 kGE 

37 MHz 

5 




VLIW (40 nm) 

6364 kGE 

700 MHz 

21 

Proposed 

TTA (90 nm) 

405 kGE 

210 MHz 

TST 


VI. Conclusion 

We propose a modified FEE algorithm for FR. We simulated 
in Matlab the performance of the algorithm and propose a cus¬ 
tomized multiprocessor architecture for the MFFF. The cores 
are programmable with the help of a retargetable compiler. The 
flexible implementation shows great promise to support later 
field updates and provides high throughput with a moderate 
cost. 
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